KEP-3695: extend pod resource API to for DRA #3738

moshe010 · 2023-01-14T19:08:50Z

Signed-off-by: Moshe Levi moshele@nvidia.com

One-line PR description:
Extend pod resource API to for Dynamic Resource Allocation

Issue link:
DRA: Extend PodResources to include resources from Dynamic Resource Allocation #3695

Other comments:

k8s-ci-robot · 2023-01-14T19:08:59Z

Hi @moshe010. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bart0sh · 2023-01-14T21:59:21Z

/ok-to-test

bart0sh · 2023-01-14T22:00:47Z

/assign
/assign @klueska @derekwaynecarr @kad

ffromani · 2023-01-23T11:41:31Z

/cc

keps/sig-node/3695-pod-resources-for-dra/README.md

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

keps/sig-node/3695-pod-resources-for-dra/README.md

klueska · 2023-01-24T13:17:34Z

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

Just to make sure, did you also follow the latest template for the KEP itself?:
https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template

ffromani

I'm generally fine with the content of the KEP here, I would just like to have documented better why DRA can't provide data useful to GetAllocatableResources, and if this status can possibly change in the future. Few lines are sufficient.

ffromani · 2023-01-24T14:23:41Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+To enhance pod resource API GetAllocatableResources for resources that managed by DRA. With DRA there is not standard
+way to get resource captivity.


non-blocking: I understand we can't provide something in turn DRA is not providing, but making the API more irregular makes me pause. Can we try to list options here to sketch some plans for the future? It would also be nice to spend few lines to explain why DRA can't provide this information.

maybe this issue of common way to get resource captivity with DRA should be tracked as separate issue?

@klueska maybe we can standardize the nodeallocationstates CRD https://gitlab.com/nvidia/cloud-native/k8s-dra-driver/-/blob/main/deployments/static/crds/gpu.resource.nvidia.com_nodeallocationstates.yaml as CRD for the DRA kubelet plugin?

Not sure how we would (or even could) provide Capacity or Allocatable for DRA managed resources.

To answer your question directly, I don't think it's reasonable to assume there will be a standard representation for each resource managed by a DRA resource driver. It's not even required that a CRD be used for this purpose, let alone adhere to a specific format.

Besides that, the notion of a "count" for a DRA managed resource is itself not well defined since each resource that gets allocated can a take drastically different form depending on what ClaimParameter objects are used to request them.

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

klueska

Made it about a third of the way through and need to take a break. Will come back to it soon.

keps/sig-node/3695-pod-resources-for-dra/README.md

moshe010 · 2023-01-25T20:19:25Z

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

I wasn't aware of it. Who should I put as the approver? should I just pick one from the list
https://github.com/kubernetes/enhancements/blob/master/OWNERS_ALIASES#L191-L194?

Just to make sure, did you also follow the latest template for the KEP itself?: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template
Yes I used the latest template.

ffromani · 2023-01-26T10:12:11Z

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

I wasn't aware of it. Who should I put as the approver? should I just pick one from the list https://github.com/kubernetes/enhancements/blob/master/OWNERS_ALIASES#L191-L194?

Yes, please pick one. The PRR reviewers will rebalance the load internally and suggest a replacement should the one you picked turn out to be overloaded.

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

Signed-off-by: Moshe Levi <moshele@nvidia.com>

dchen1107 · 2023-02-07T18:57:53Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+Our proposal is to extend the existing PodResources gRPC service of the Kubelet with a repeated DynamicResource field in the ContainerResources message. This new field will contain information about the DRA resource class, the DRA resource claim and a list of CDI Devices allocated by a DRA driver.
+Additionally, we propose adding a Get method to the existing gRPC service to allow querying specific pods for their allocated resources.
+
+The extended interface is shown in proto below:


I understand the proposed API is read-only and mainly for the monitoring, but not sure the entire flow. May I have a couple of questions answered before approving this KEP:

We are using polling here. But what if DRA driver is dead, is the pod using the resource continue running? How to make sure the stats up to date?

DRA might be deallocated on the node, how often kubelet to issue Get() methods to reconsolidate the status?

Do we need a registration logic with Kubelet?

I guess some of the questions might be discussed in a separate KEP, but hope we have the answers to those high level questions. Thanks!

For (1):
I guess you could think of it as polling, but the PodResources API is really more of an on-demand API that goes and gets the information it needs at the time it is requested. Nothing is cached as part of the API itself. The various components that hold the information (CPUManager, DeviceManager, etc) may cache this information, but the API doesn't "store" this information anywhere by itself.

In the case of the DRA manager, the information required to populate the new fields in the PodResources API is all cached in an in-memory store for each container that gets launched. However, unlike the CPUManager and DeviceManager, this information is not persisted / checkpointed by the DRA manager itself (meaning the cache cannot be directly repopulated across a kubelet restart). It is idempotently queriable from each DRA driver though. Meaning that the in-memory cache can be used to populate the API if the information is available, otherwise the cache can be filled by querying the DRA driver and then pushed up to the API as needed.

In response to this specific question: "But what if DRA driver is dead, is the pod using the resource continue running?"

The lifetime of a pod is not (necessarily) tied to the "readiness" of the DRA drivers that manage its resource claims. The only time a pod is blocked from deletion by a claim not being deleted (possibly because the DRA driver managing it is down) is if it is the last pod referencing that claim.

So I think what you are asking is "If a pod is running, but I can't communicate with one of the DRA drivers to query its list of DynamicResources (and this information is not available in the DRA managers in-memory cache), what do I do? Do I omit this information? Do I return an error?, etc."

Does this accurately reflect what you are trying to ask?

For (2):
What do you mean by "DRA might be deallocated on the node"? Do you mean the resources associated with a specific DRA driver might be deallocated on a node? And the question is how often the kubelet will need to consult with the DRA driver to refresh whatever information it may have about these resources? I think I answered this in my response to (1) when talking about the in-memory cache of the DRA manager and it ability to idempotently retrieve any missing information from the DRA driver.

The Get() call you refer to is something exposed by the kubelet (not something the kubelet calls itself), so I assume you actually meant NodePrepareResources() (which is the idempotent call a kubelet can make against a DRA driver to get the list of CDI devices associated with a claim).

For (3):
What are you proposing should have registration logic with the kubelet? DRA resource drivers already do register with the kubelet (so it can make calls against them to Prepare/Unprepare resources). I think I need more context here to answer the question properly.

@dchen1107
I have updated the Proposed API section with more details that hopefully address your comments / concerns. PTAL

https://github.com/kubernetes/enhancements/pull/3738/files#diff-190ad7c104ab2cace0e7696b41212380fa0825a6b5203ca027af4137c342e487R95

@klueska Thanks for the detailed response.

You answered my questions. I tried to figure out how idempotent the behavior is since DRA is dynamic, the the status how persistent and consistent over the kubelet and / or DRA driver restarts. Even we might still have the inconsistency over the restart, but it is expected to some extent. I am accepting it for now.

Also the KEP heavily depends on the DRA, but KEP doesn't include a high level component diagram capturing the dependencies and interfaces. It assumes the reviewers have all information. My original wish is to answer my questions, we could address the issue for the future readers. I guess I didn't state it clearly. I am ok with it for now.

No problem. They were important questions to ask.

Regarding:

Even we might still have the inconsistency over the restart, but it is expected to some extent.

I don’t think we will have any inconstancies. The PodResources API is provided by the kubelet itself, so if it is down the API is inaccessible anyway. And once it comes back up it will reconcile all of the info it needs before it starts serving the API again.

will there be any race if API is called too early before everything got registered?

As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?

will there be any race if API is called too early before everything got registered?

There shouldn't be. The set of CDI devices to allocate to a container is pulled from the various DRA plugins before a container is ever created. Either all plugins successfully allocate devices (thus populating the cache) and the container eventually gets created / starts running, or some plugin fails and the container start fails as well. Meaning only containers whose cached CDI device info is complete will ever be created and thus accessible through the API.

As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?

That shouldn't matter. If a plugin does not respond, then the container will never be started and thus not something queriable through the PodResources API.

Signed-off-by: Kevin Klues <kklues@nvidia.com>

cici37

Have couple PRR questions. Since this proposal highly depends on DRA, I guess we should carefully write the different/dependency here which are all questions about.

/cc @johnbelamaric

keps/sig-node/3695-pod-resources-for-dra/README.md

cici37 · 2023-02-08T17:11:43Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+
+###### Does this feature depend on any specific services running in the cluster?
+
+No.


Does it require DRA working?

It requires DRA to be available, but DRA is not a service (it is something built into the kubelet / default scheduler). Simply ensuring that the DRA feature gate is enabled is enough to ensure that DRA is available, so I tend to agree that this does not depend on any services running in the cluster (as stated currently).

I updated this section to be the same as in the DRA kep https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md#dependencies

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

dchen1107 · 2023-02-08T20:09:57Z

/lgtm
/approve from SIG Node perspective

johnbelamaric

I'd like to see a few minor updates the PRR answers but overall PRR looks good.

keps/sig-node/3695-pod-resources-for-dra/README.md

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

Signed-off-by: Moshe Levi <moshele@nvidia.com>

johnbelamaric · 2023-02-09T15:31:06Z

/approve
/lgtm

k8s-ci-robot · 2023-02-09T15:31:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, klueska, moshe010

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

BenTheElder · 2024-01-09T00:39:43Z

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

+owning-sig: sig-node
+participating-sigs: []
+status: provisional
+creation-date: implementable


this field should be the date created, right now this gives the misleading impression that the kep is approved "implementable" when the status is "provisional" and it is missing the date

see e.g.

enhancements/keps/sig-node/3673-kubelet-parallel-image-pull-limit/kep.yaml

Line 8 in e09ddc2

creation-date: 2023-01-05

It looks like something got screwed up here.

The date should definitely be fixed, but the status should, in fact, be implementable, not provisional (the code that implements this KEP was merged in kubernetes/kubernetes#115847).

aojea · 2024-09-25T08:44:34Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+### Goals
+
+- To allow node monitoring agents to know the allocated DRA resources for Pods on a node.
+- To allow the DRA feature to work with CNIs that require complex network devices such as RDMA. DRA resource drivers will allocate the resources, and the meta-plugin will read the allocated [CDI Devices](https://github.com/container-orchestrated-devices/container-device-interface) using the PodResources API. The meta-plugin will then inject the device-id of these CDI Devices as CNI arguments and invoke other CNIs (just as it does for devices allocated by the device plugin today).


This approach was not discussed in SIG Network and WG Multinetwork forums AFAIK, it will be better to involve all the stakeholders in these discussions to be able to get a uniform solution.

The recent developments and consensus for the Networking extension with DRA are described in #4861

aojea · 2024-09-25T09:14:45Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+
+#### Beta
+
+- [ ] Gather feedback from consumers of the DRA feature and k8snetworkplumbingwg working group


and SIG Network

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 14, 2023

k8s-ci-robot requested a review from dchen1107 January 14, 2023 19:08

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 14, 2023

k8s-ci-robot requested a review from derekwaynecarr January 14, 2023 19:08

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 14, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2023

k8s-ci-robot assigned bart0sh, derekwaynecarr, kad and klueska Jan 14, 2023

klueska mentioned this pull request Jan 23, 2023

DRA: Extend PodResources to include resources from Dynamic Resource Allocation #3695

Open

8 tasks

k8s-ci-robot requested a review from ffromani January 23, 2023 11:41

swatisehgal reviewed Jan 23, 2023

View reviewed changes

moshe010 force-pushed the dra-pod-resource-api branch from c859f61 to 65cfcd2 Compare January 24, 2023 07:19

ffromani reviewed Jan 24, 2023

View reviewed changes

klueska reviewed Jan 24, 2023

View reviewed changes

moshe010 force-pushed the dra-pod-resource-api branch 2 times, most recently from 6c00f11 to 284623e Compare January 26, 2023 08:05

ffromani reviewed Jan 26, 2023

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/kep.yaml Outdated Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2023

KEP-3695: extend pod resource API to for DRA

ccacf27

Signed-off-by: Moshe Levi <moshele@nvidia.com>

moshe010 force-pushed the dra-pod-resource-api branch from 24f9123 to ccacf27 Compare February 7, 2023 18:30

dchen1107 reviewed Feb 7, 2023

View reviewed changes

klueska force-pushed the dra-pod-resource-api branch from fd9da2f to ccacf27 Compare February 8, 2023 11:26

Update to add details of how the API will be serviced for the new fields

1191ad8

Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska force-pushed the dra-pod-resource-api branch from 404fbd9 to 1191ad8 Compare February 8, 2023 13:24

cici37 reviewed Feb 8, 2023

View reviewed changes

k8s-ci-robot assigned dchen1107 Feb 8, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2023

dchen1107 added this to the v1.27 milestone Feb 8, 2023

johnbelamaric reviewed Feb 8, 2023

View reviewed changes

address PPR questions

eaa6675

Signed-off-by: Moshe Levi <moshele@nvidia.com>

moshe010 force-pushed the dra-pod-resource-api branch from 091b230 to eaa6675 Compare February 9, 2023 09:00

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2023

k8s-ci-robot merged commit 6444514 into kubernetes:master Feb 9, 2023

This was referenced Mar 13, 2023

extend PodResources API for Dynamic Resource Allocation kubernetes/website#39966

Closed

Extend PodResources API for Dynamic Resource Allocation kubernetes/website#39978

Merged

moshe010 mentioned this pull request Mar 21, 2023

REQUEST: New membership for moshe010 kubernetes/org#4103

Closed

9 tasks

ffromani mentioned this pull request Aug 8, 2023

Add post-release blog article about changes to Pod resources API kubernetes/website#42041

Merged

BenTheElder reviewed Jan 9, 2024

View reviewed changes

moshe010 mentioned this pull request Feb 28, 2024

DRA: kubelet: checkpoint checksum changes when API gets extended kubernetes/kubernetes#123552

Closed

aojea reviewed Sep 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-3695: extend pod resource API to for DRA #3738

KEP-3695: extend pod resource API to for DRA #3738

moshe010 commented Jan 14, 2023

k8s-ci-robot commented Jan 14, 2023

bart0sh commented Jan 14, 2023

bart0sh commented Jan 14, 2023

ffromani commented Jan 23, 2023

klueska commented Jan 24, 2023

ffromani left a comment

ffromani Jan 24, 2023

moshe010 Jan 25, 2023

klueska Feb 4, 2023

klueska left a comment

moshe010 commented Jan 25, 2023

ffromani commented Jan 26, 2023

dchen1107 Feb 7, 2023

klueska Feb 7, 2023 •

edited

Loading

klueska Feb 8, 2023

dchen1107 Feb 8, 2023

klueska Feb 8, 2023 •

edited

Loading

SergeyKanzhelev Feb 9, 2023

SergeyKanzhelev Feb 9, 2023

klueska Feb 9, 2023

cici37 left a comment

cici37 Feb 8, 2023

klueska Feb 8, 2023

moshe010 Feb 8, 2023 •

edited

Loading

dchen1107 commented Feb 8, 2023

johnbelamaric left a comment

johnbelamaric commented Feb 9, 2023

k8s-ci-robot commented Feb 9, 2023

BenTheElder Jan 9, 2024

klueska Jan 16, 2024 •

edited

Loading

aojea Sep 25, 2024

aojea Sep 25, 2024

		To enhance pod resource API GetAllocatableResources for resources that managed by DRA. With DRA there is not standard
		way to get resource captivity.


		###### Does this feature depend on any specific services running in the cluster?

		No.


		#### Beta

		- [ ] Gather feedback from consumers of the DRA feature and k8snetworkplumbingwg working group

KEP-3695: extend pod resource API to for DRA #3738

KEP-3695: extend pod resource API to for DRA #3738

Conversation

moshe010 commented Jan 14, 2023

k8s-ci-robot commented Jan 14, 2023

bart0sh commented Jan 14, 2023

bart0sh commented Jan 14, 2023

ffromani commented Jan 23, 2023

klueska commented Jan 24, 2023

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska left a comment

Choose a reason for hiding this comment

moshe010 commented Jan 25, 2023

ffromani commented Jan 26, 2023

Choose a reason for hiding this comment

klueska Feb 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cici37 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moshe010 Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

dchen1107 commented Feb 8, 2023

johnbelamaric left a comment

Choose a reason for hiding this comment

johnbelamaric commented Feb 9, 2023

k8s-ci-robot commented Feb 9, 2023

Choose a reason for hiding this comment

klueska Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Feb 7, 2023 •

edited

Loading

klueska Feb 8, 2023 •

edited

Loading

moshe010 Feb 8, 2023 •

edited

Loading

klueska Jan 16, 2024 •

edited

Loading